In this practical session, we will explore how to perform typical tasks associated with image retrieval. Students will be able to download this IPython/Jupyter notebook after the class in order to perform the experiments also at home.
Link to the slides: PDF
In order to follow this tutorial, please create a new copy of this notebook and name your copy using your own name. Do not run this notebook directly, as it is read-only for students and any changes will not be able to be saved. To start following this tutorial:
In order to run a cell, select a cell and press 'shift + enter'
During the execution of this notebook, you will find some questions that need to be answered. Please write your answers in a separate text file and send us by e-mail at rafael.sampaio-de-rezende@naverlabs.com If you work together with other people during the practical session, you can send a single answer file for two or three people.
We start by importing the necessary modules and fixing a random seed. Please select the cell below and press 'shift+enter':
import numpy as np
from numpy.linalg import norm
import torch
from torch import nn
import json
import pdb
import sys
import os.path as osp
import pandas as pd
from PIL import Image
import sys
import warnings
from datasets import create
from archs import *
from utils.test import extract_query
from utils.tsne import do_tsne
np.random.seed(0)
print('Ready!')
Now, let's start by instantiating the Oxford dataset, that we will use in all following experiments.
# create Oxford 5k database
dataset = create('Oxford')
We can now query for some aspects of this dataset, such as the number of images, number of classes, the name of the different classes, and the class label for each of the images in the dataset:
print('Dataset: ' + dataset.dataset_name)
print()
labels = dataset.get_label_vector()
classes = dataset.get_label_names()
print('Number of images: ' + str(labels.shape[0]))
print('Number of classes: ' + str(classes.shape[0]))
print()
print('Class names: ' + str(classes))
Now, let's load a list of models we can use in this tutorial:
# load the dictionary of the available models and features
with open('data/models.json', 'r') as fp:
models_dict = json.load(fp)
pd.DataFrame(models_dict).T # show the loaded models onscreen
In this first part of the tutorial, we will study how different changes in the training pipeline (e.g. choice of model, pooling, and post-processing options) can change the quality of results we obtain.
As a first step, we will be creating a neural network implementing the AlexNet architecture to use in our experiments.
# instantate the model for the first experiment
model_1a = alexnet_imagenet()
# show the network details
print(model_1a)
Now, we could use this model to extract features for all images in our dataset. In order to make this faster, we have already precomputed those features and stored them in the disk.
In order to load the features computed by this model from the disk, run the cell below:
dfeats = np.load(models_dict['alexnet-cls-imagenet-fc7']['dataset'])
pd.DataFrame(dfeats)
Question 1: What does each row of the matrix feats represent?
Question 2: Where does the dimension of these lines comes from and how do we extract these features?
Hint: if you do not know the answers for the questions above, try running the following command:
model_1a_test = alexnet_imagenet_fc7(); print(model_1a_test)
Now, assuming that we have already used our network to extract features from all images in the dataset and stored them in the matrix dfeats (as done above), we will retrieve the top-15 images that are most similar to a query image. In our example, we will use the following image as a query:
q_idx = 11 # feel free to switch to another number afterwards, but test first with 11
# visualize top results for a given query
dataset.vis_top(dfeats, q_idx, ap_flag=True)
To the right of the query image, we plot the best retrieval results, with decreasing similarity from left to right. Images in green frames are true matches, red frames are false matches, and gray frames are so-called 'junk' matches (images from the same landmark, but from angles too different or at wrong spots). Junk matches are ignored during the calculation of the AP.
Now we will use the t-SNE algorithm to cluster images together according to feature similarity:
do_tsne(dfeats, labels, classes, sec='1a')
Question 3: What can be observe from the t-SNE visualization? Which classes 'cluster' well? Which do not?
Now, we will see what happens when we fine-tune our off-the-shelf ImageNet network in the Landmarks dataset and then repeat the process above.
We can quickly compare some exemples of images of both training datasets.
Image.open('figs/imagenet_ex.png')
Image.open('figs/lm_ex.png')
Question 4: Should we get better results? What should change? Why?
model_1b = alexnet_lm() # instantate the model that has been fine-tuned in landmarks
print(model_1b) # show the network details
Compare with the model we had before:
print(model_1a)
Question 5: Why do we change the last layer of the AlexNet architecture?
Question 6: How do we initialize the layers of model_1b for finetuning?
Let's now repeat the same process we had done before, but now using image features that have been extracted using the fine-tuned network.
dfeats = np.load(models_dict['alexnet-cls-lm-fc7']['dataset'])
pd.DataFrame(dfeats)
Visualize the top-15 most similar images:
dataset.vis_top(dfeats, q_idx, ap_flag=True)
do_tsne(dfeats, labels, classes, sec='1b')
Question 6: How does the visualization change after finetuning? What about the top results?
Question 7: Why images need to be resized to 224x224 before they can be fed to AlexNet? How can this affect results?
Now, we will replace the last max pooling layer of our network with a GeM layer and see how this affects the results. For this model, we remove all fully connected layers (classifier layers) and replace the last max pooling layer by an aggregation pooling layer (more details about this layer in the next subsection).
model_1c = alexnet_GeM() # instantate the fine-tuned model with a GeM layer instead of max-pooling
print(model_1c) # show the network details. Can you identify what has changed?
Compare with the model we had before:
print(model_1b)
We assume again we have used this model to extract features from all images and stored them in the dfeats variable:
dfeats = np.load(models_dict['alexnet-cls-lm-gem']['dataset'])
pd.DataFrame(dfeats)
Question 8: Why does the size of the feature representation changes?
Question 9: Why does the size of the feature representation is important for a image retrieval task?
Now, let's continue visualizing the top-15 most similar images:
dataset.vis_top(dfeats, q_idx, ap_flag=True)
do_tsne(dfeats, labels, classes, sec='1c')
Question 10: How does the aggregation layer changes the t-SNE visualization?
Question 11: Can we see some structure in the clusters of similarly labeled images?
Now, we will replace the base architecture of our network (the backbone) with a ResNet18 architecture.
model_0 = resnet18() # instantiate one model with average pooling and another
model_1d = resnet18_GeM() # with GeM pooling with the same ResNet18 architecture
print(model_0.adpool) # Show how the last layers of the two models are different
print(model_1d.adpool)
Question 12: Why do we change the average pooling layer of the original Resnet18 architecture for a generalized mean pooling?
Question 13: What operation is the layer model_1d.adpool doing?
model_1d.adpool
Now let's do the same as before and visualize the features and top-15 most similar images to our query:
Let's use a different image for testing this time:
q_idx = 411
Now, let's load Oxford features from ResNet18 model and visualize the top-15 results for the given query index
do_tsne(dfeats, labels, classes, sec='1d')
Question 14: How does this model compare with model 1c, that was trained in the same dataset for the same task?
Question 15: How does is compare to the finetuned models of 1b?
Now we will investigate the effects of whitening our descriptors and queries. We will not be changing anything in the network.
# We use a PCA learnt on landmarks to whiten the output features of 'resnet18-cls-lm-gem'
dfeats = np.load(models_dict['resnet18-cls-lm-gem-pcaw']['dataset'])
qfeats = np.load(models_dict['resnet18-cls-lm-gem-pcaw']['queries'])
dataset.vis_top(dfeats, q_idx, q_feat=qfeats[q_idx], ap_flag=True)
Visualize the data with t-SNE (excluding unlabeled images)
do_tsne(dfeats, labels, classes, sec='1e-1')
Visualize the data with t-SNE (including unlabeled images)
do_tsne(dfeats, labels, classes, sec='1e-2', show_unlabeled=True)
Question 16: What can we say about the separation of data when included unlabeled images?
Question 17: And the distribution of the unlabeled features?
Question 18: How can we train a model to separate labeled from unlabeled data?
Now we learn the architecture presented in item e) in an end-to-end manner for the retrieval task. The architecture includes a FC layer that replaces the PCA projection.
dataset.vis_triplets(nplots=5)
# will print 5 examples of triplets (tuples with a query, a positive, and a negative)
Now, let's visualize the top results as before:
# load Oxford features from ResNet18 model trained with triplet loss
dfeats = np.load(models_dict['resnet18-rnk-lm-gem']['dataset'])
qfeats = np.load(models_dict['resnet18-rnk-lm-gem']['queries'])
dataset.vis_top(dfeats, q_idx, q_feat=qfeats[q_idx], ap_flag=True)
Visualize the data with t-SNE (excluding unlabeled images)
do_tsne(dfeats, labels, classes, sec='1f-1')
Visualize the data with t-SNE (including unlabeled images)
do_tsne(dfeats, labels, classes, sec='1f-1', show_unlabeled=True)
Question 19: Compare the plots with unlabeled data of the model trained for retrieval (with triplet loss) and the model trained for classification of the previous subsection. How do they change?
Let's now check the effects of adding data augmentation techniques to the training. We will now compare models that have been trained with and without data augmentation.
We will load features that have been trained with the following data augmentation: cropping, pixel jittering, rotation, and tilting. This means that this model has been trained with the original image and its transformed versions. Please note that not all transformations might be useful for every class or image, but it is impossible to know in advance how the pictures were taken and the characteristics of each individual class a priori.
For example, cropping is useful when the landmark of interest is usually not found at the center of the image (e.g. selfies taken in front of the tour Eiffel).
Another standard practice besides data augmentation is to consider different variations of the same picture but at different resolutions. There are multiple ways to combine features extracted from those images, such as average pooling or spatial pyramids.
Using a model trained with data augmentation, we now extract features at 4 different resolutions and average the outputs.
Let's visualize the top results just like before:
dfeats = np.load(models_dict['resnet18-rnk-lm-gem-da-mr']['dataset'])
qfeats = np.load(models_dict['resnet18-rnk-lm-gem-da-mr']['queries'])
dataset.vis_top(dfeats, q_idx, q_feat=qfeats[q_idx], ap_flag=True)
Visualize the data with t-SNE (excluding unlabeled images)
do_tsne(dfeats, labels, classes, sec='1g')
Question 20: What is the difference in AP between a model that has trained with and without data augmentation?
Question 21: What about the clustering? Why do you believe some of the classes have not been adequately clustered yet?
Question 22: What other data augmentation or pooling techniques would you suggest to improve results? Why?
Finally, we will now upgrade the backbone architecture to Resnet50.
dfeats = np.load(models_dict['resnet50-rnk-lm-gem-da-mr']['dataset'])
qfeats = np.load(models_dict['resnet50-rnk-lm-gem-da-mr']['queries'])
dataset.vis_top(dfeats, q_idx, q_feat=qfeats[q_idx], ap_flag=True)
Visualize the data with t-SNE (excluding unlabeled images)
do_tsne(dfeats, labels, classes, sec='1h')
Question 24: Why using a larger architecture results in a higher AP? Is this always going to be the case?